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ABSTRACT 

Recent  advances  in  Automatic  Speech  Recognition  technology  have 
put  the  goal  of  naturally  sounding  dialog  systems  within  reach. 
However,  the  improved  speech  recognition  has  brought  to  light 
a  new  problem:  as  dialog  systems  understand  more  of  what  the 
user  tells  them,  they  need  to  be  more  sophisticated  at  responding 
to  the  user.  The  issue  of  system  response  to  users  has  been  ex¬ 
tensively  studied  by  the  natural  language  generation  community, 
though  rarely  in  the  context  of  dialog  systems.  We  show  how  re¬ 
search  in  generation  can  be  adapted  to  dialog  systems,  and  how 
the  high  cost  of  hand-crafting  knowledge-based  generation  systems 
can  be  overcome  by  employing  machine  learning  techniques. 

1.  DIALOG  SYSTEMS  AND  GENERATION 

Recent  advances  in  Automatic  Speech  Recognition  (ASR)  tech¬ 
nology  have  put  the  goal  of  naturally  sounding  dialog  systems  within 
reach.*  However,  the  improved  ASR  has  brought  to  light  a  new 
problem:  as  dialog  systems  understand  more  of  what  the  user  tells 
them,  they  need  to  be  more  sophisticated  at  responding  to  the  user. 

If  ASR  is  limited  in  quality,  dialog  systems  typically  employ  a 
system-initiative  dialog  strategy  in  which  the  dialog  system  prompts 
the  user  for  specific  information  and  then  presents  some  informa¬ 
tion  to  the  user.  In  this  paradigm,  the  range  of  user  input  at  any  time 
is  limited  (thus  facilitating  ASR),  and  the  range  of  system  output  at 
any  time  is  also  limited.  However,  such  interactions  are  not  very 
natural.  In  a  more  natural  interaction,  the  user  can  supply  more  and 
different  information  at  any  time  in  the  dialog.  The  dialog  system 
must  then  support  a  mixed-initiative  dialog  strategy.  While  this 
strategy  places  greater  requirements  on  ASR,  it  also  increases  the 
range  of  system  responses  and  the  requirements  on  their  quality  in 
terms  of  informativeness  and  of  adaptation  to  the  context. 

For  a  long  time,  the  issue  of  system  response  to  users  has  been 
studied  by  the  Natural  Language  Generation  (NLG)  community, 
though  rarely  in  the  context  of  dialog  systems.  What  have  emerged 
from  this  work  are  a  “consensus  architecture”  [17]  which  modu¬ 
larizes  the  large  number  of  tasks  performed  during  NLG  in  a  par- 

*  The  work  reported  in  this  paper  was  partially  funded  by  DARPA 
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ticular  way,  and  a  range  of  linguistic  representations  which  can  be 
used  in  accomplishing  these  tasks.  Many  systems  have  been  built 
using  NLG  technology,  including  report  generators  [8,  7],  system 
description  generators  [10],  and  systems  that  attempt  to  convince 
the  user  of  a  particular  view  through  argumentation  [20,  4]. 

In  this  paper,  we  claim  that  the  work  in  NLG  is  relevant  to  dia¬ 
log  systems  as  well.  We  show  how  the  results  can  be  incorporated, 
and  report  on  some  initial  work  in  adapting  NLG  approaches  to  di¬ 
alog  systems  and  their  special  needs.  The  dialog  system  we  use  is 
the  AT&T  Communicator  travel  planning  system. We  use  machine 
learning  and  stochastic  approaches  where  hand-crafting  appears  to 
be  too  complex  an  option,  but  we  also  use  insight  gained  during 
previous  work  on  NLG  in  order  to  develop  models  of  what  should 
be  learned.  In  this  respect,  the  work  reported  in  this  paper  differs 
from  other  recent  work  on  generation  in  the  context  of  dialog  sys¬ 
tems  [12,  16],  which  does  not  modularize  the  generation  process 
and  proposes  a  single  stochastic  model  for  the  entire  process.  We 
start  out  by  reviewing  the  generation  architecture  (Section  2).  In 
Section  3,  we  discuss  the  issue  of  text  planning  for  Communicator. 
In  Section  4,  we  summarize  some  initial  work  in  using  machine 
learning  for  sentence  planning  [19].  Finally,  in  Section  5  we  sum¬ 
marize  work  using  stochastic  tree  models  in  generation  [2]. 

2.  TEXT  GENERATION  ARCHITECTURE 

NLG  is  conceptualized  as  a  process  leading  from  a  high-level 
communicative  goal  to  a  sequence  of  communicative  acts  which 
accomplish  this  communicative  goal.  A  communicative  goal  is  a 
goal  to  affect  the  user’s  cognitive  state,  e.g.,  his  or  her  beliefs  about 
the  world,  desires  with  respect  to  the  world,  or  intentions  about 
his  or  her  actions  in  the  world.  Following  (at  least)  [13],  it  has 
been  customary  to  divide  the  generation  process  into  three  phases, 
the  first  two  of  which  are  planning  phases.  Reiter  [17]  calls  this 
architecture  a  “consensus  architecture”  in  NLG. 

•  During  text  planning,  a  high-level  communicative  goal  is 
broken  down  into  a  structured  representation  of  atomic  com¬ 
municative  goals,  i.e.,  goals  that  can  be  attained  with  a  single 
communicative  act  (in  language,  by  uttering  a  single  clause). 
The  atomic  communicative  goals  may  be  linked  by  rhetori¬ 
cal  relations  which  show  how  attaining  the  atomic  goals  con¬ 
tributes  to  attaining  the  high-level  goal. 


•  During  sentence  planning,  abstract  linguistic  resources  are 
chosen  to  achieve  the  atomic  communicative  goals.  This 
includes  choosing  meaning-bearing  lexemes,  and  how  the 
meaning-bearing  lexemes  are  connected  through  abstract  gram¬ 
matical  constructions  (basically,  lexical  predicate-argument 
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Figure  1:  Architecture  of  a  dialog  system  with  natural  language  generation 


structure  and  modification).  As  a  side-effect,  sentence  plan¬ 
ning  also  determines  sentence  boundaries:  there  need  not 
be  a  one-to-one  relation  between  elementary  communicative 
goals  and  sentences  in  the  final  text. 

•  During  realization,  the  abstract  linguistic  resources  chosen 
during  sentence  planning  are  transformed  into  a  surface  lin¬ 
guistic  utterance  by  adding  function  words  (such  as  auxil¬ 
iaries  and  determiners),  inflecting  words,  and  determining 
word  order.  This  phase  is  not  a  planning  phase  in  that  it  only 
executes  decisions  made  previously,  by  using  grammatical 
information  about  the  target  language.  (Prosody  assignment 
can  be  treated  as  a  separate  module  which  follows  realization 
and  which  draws  on  all  previous  levels  of  representation.  We 
do  not  discuss  prosody  further  in  this  paper.) 

Note  that  sentence  planning  and  realization  use  resources  spe¬ 
cific  to  the  target-language,  while  text  planning  is  language-independent 
(though  presumably  it  is  culture-dependent). 

In  integrating  this  approach  into  a  dialog  system,  we  see  that  the 
dialog  manager  (DM)  no  longer  determines  surface  strings  to  send 
to  the  TTS  system,  as  is  often  the  case  in  current  dialog  systems. 
Instead,  the  DM  determines  high-level  communicative  goals  which 
are  sent  to  the  NLG  component.  Figure  1  shows  a  complete  archi¬ 
tecture.  An  advantage  of  such  an  architecture  is  the  possibility  for 
extended  plug-and-play:  not  only  can  the  entire  NLG  system  be 
replaced,  but  also  modules  within  the  NLG  system,  thus  allowing 
researchers  to  optimize  the  system  incrementally. 

The  main  objection  to  the  use  of  NLG  techniques  in  dialog  sys¬ 
tems  is  that  they  require  extensive  hand-tuning  of  existing  sys¬ 
tems  and  approaches  for  new  domains.  Furthermore,  because  of 
the  relative  sophistication  of  NLG  techniques  as  compared  to  sim¬ 
pler  techniques  such  as  templates,  the  hand-tuning  requires  spe¬ 
cialized  knowledge  of  linguistic  representations;  hand-tuning  tem¬ 
plates  only  requires  software  engineering  skills.  An  approach  based 
on  machine  learning  can  provide  a  solution  to  this  problem:  it 
draws  on  previous  research  in  NLG  and  uses  the  same  sophisti¬ 
cated  linguistic  representations,  but  it  learns  the  domain-specific 
rules  that  use  these  representation  automatically  from  data.  It  is  the 
goal  of  our  research  to  show  that  for  dialog  systems,  approaches 
based  on  machine  learning  can  do  as  well  as  or  outperform  hand¬ 
crafted  approaches  (be  they  NLG-  or  template-based),  while  requir¬ 
ing  far  less  time  for  tuning.  In  the  following  sections,  we  summa¬ 
rize  the  current  state  of  our  research  on  an  NLG  system  for  the 
Communicator  dialog  system. 


3.  TEXT  PLANNER 

Based  on  observations  from  the  travel  domain  of  the  Communi¬ 
cator  system,  we  have  categorized  system  responses  into  two  types. 
The  first  type  occurs  during  the  initial  phase  when  the  system  is 
gathering  information  from  the  user.  During  this  phase,  the  high- 
level  communicative  goals  that  the  system  is  trying  to  achieve  are 
fairly  complex:  the  goals  include  getting  the  hearer  to  supply  in¬ 
formation,  and  to  explicitly  or  implicitly  confirm  information  that 
the  hearer  has  just  supplied.  (These  latter  goals  are  often  motivated 
by  the  still  not  perfect  quality  of  ASR.)  The  second  type  occurs 
when  the  system  has  obtained  information  that  matches  the  user’s 
requirements  and  the  options  (flights,  hotel,  or  car  rentals)  need  to 
be  presented  to  the  user.  Here,  the  communicative  goal  is  mainly  to 
make  the  hearer  believe  a  certain  set  of  facts  (perhaps  in  conjunc¬ 
tion  with  a  request  for  a  choice  among  these  options). 

In  the  past,  NLG  systems  typically  have  generated  reports  or 
summaries,  for  which  the  high-level  communicative  goal  is  of  the 
type  “make  the  hearer/reader  believe  a  given  set  of  facts”,  as  it  is 
in  the  second  type  of  system  response  discussed  above.  We  believe 
that  NLG  work  in  text  planning  can  be  successfully  adapted  to  bet¬ 
ter  plan  these  system  responses,  taking  into  account  not  only  the 
information  to  be  conveyed  but  also  the  dialog  context  and  knowl¬ 
edge  about  user  preferences.  We  leave  this  to  ongoing  work. 

In  the  first  type  of  system  response,  the  high-level  communica¬ 
tive  goal  typically  is  an  unordered  list  of  high-level  goals,  all  of 
which  need  to  be  achieved  with  the  next  turn  of  the  system.  An  ex¬ 
ample  is  shown  in  Figure  2.  NLG  work  in  text  planning  has  not  ad¬ 
dressed  such  complex  communicative  goals  in  the  past.  However, 
we  have  found  that  for  the  Communicator  domain,  no  text  planning 
is  needed,  and  that  the  sentence  planner  can  act  directly  on  a  rep¬ 
resentation  of  the  type  shown  in  Figure  2,  because  the  number  of 
goals  is  limited  (to  five,  in  our  studies).  We  expect  that  further  work 
in  other  dialog  domains  will  require  an  extension  of  existing  work 
in  text  planning  to  account  better  for  communicative  goals  other 
than  those  that  simply  aim  to  affect  the  user’s  (hearer’s)  beliefs. 

implicit-confirm(orig-city :  NEWARK) 
implicit-confirm(dest-city :  DALLAS ) 
implicit-confirm(month:9) 
implicit-confirm(day-number:  1 ) 
request(depart-time) 


Figure  2:  Sample  text  plan  (communicative  goals) 


Realization 

Score  1 

What  time  would  you  like  to  travel  on  September  the  1st  to  Dallas  from  Newark? 

^ 

Leaving  on  September  the  1st.  What  time  would  you  like  to  travel  from  Newark  to  Dallas? 

Leaving  in  September.  Leaving  on  the  1st.  What  time  would  you,  traveling  from  Newark 
to  Dallas,  like  to  leave? 

2 

Figure  3:  Sample  alternate  realizations  of  the  set  of  commnnicative  goals  shown  in  Figure  2  suggested  by  our  sentence  planner,  with 
human  scores 


Text  Plan  Chosen  sp-tree  with  associated  DSyntS 


Flgnre  4:  Architecture  of  our  sentence  planner 


4.  SENTENCE  PLANNER 

The  principal  challenge  facing  sentence  planning  for  dialog  sys¬ 
tems  is  that  there  is  no  good  corpus  of  naturally  occurring  interac¬ 
tions  of  the  type  that  need  to  occur  between  a  dialog  system  and  hu¬ 
man  users.  This  is  because  of  the  not-yet  perfect  ASR  and  the  need 
for  implicitly  or  explicitly  confirming  most  or  all  of  the  informa¬ 
tion  provided  by  the  user.  In  conversations  between  two  humans, 
communicative  goals  such  as  implicit  or  explicit  confirmations  are 
rare,  and  thus  transcripts  of  human-human  interactions  in  the  same 
domain  cannot  be  used  for  the  purpose  of  learning  good  strategies 
to  attain  communicative  goals.  And  of  course  we  do  not  want  to 
use  transcripts  of  existing  systems,  as  we  want  to  improve  on  their 
performance,  not  mirror  it. 

We  have  therefore  taken  the  approach  of  randomly  generating  a 
set  of  solutions  and  having  human  judges  score  each  of  the  options. 
Each  turn  of  the  system  is,  as  described  in  Section  3,  characterized 
by  a  set  of  high-level  goals  such  as  that  shown  in  Figure  2.  In  the 
turns  we  consider,  no  text  planning  is  needed.  To  date,  we  have 
concentrated  on  the  issue  of  choosing  abstract  syntactic  construc¬ 
tions  (rather  than  lexical  choice),  so  we  map  each  elementary  com¬ 
municative  goal  to  a  canonical  lexico- syntactic  structure  (called  a 
DSyntS  [11]).  We  then  randomly  combine  these  DSyntSs  into 
larger  DSyntSs  using  a  set  of  clause-combining  operations  iden¬ 
tified  previously  in  the  literature  [14,  18,  5],  such  as  RELATIVE- 
CLAUSE,  CONJUNCTION,  and  MERGE. ^  The  way  in  which  the  ele¬ 
mentary  DSyntSs  are  combined  is  represented  in  a  structure  called 
the  sp-tree.  Each  sp-tree  is  then  realized  using  an  off-the-shelf  re- 
alizer,  RealPro  [9].  Some  sample  realizations  for  the  same  text  plan 
are  shown  in  Figure  3,  along  with  the  average  of  the  scores  assigned 
by  two  human  judges. 


^  MERGE  identifies  the  verbs  and  arguments  of  two  lexico-syntactic 
structures  which  differ  only  in  adjuncts.  For  example,  you  are  flying 
from  Newark  and  you  are  flying  on  Monday  are  merged  to  you  are 
flying  from  Newark  on  Monday. 


Using  the  human  scores  on  each  of  the  up  to  twenty  variants  per 
turn,  we  use  RankBoost  [6]  to  learn  a  scoring  function  which  uses 
a  large  set  of  syntactic  and  lexical  features.  The  resulting  sentence 
planner  consists  of  two  components:  the  sentence  plan  generator 
(SPG)  which  generates  candidate  sentence  plans  and  the  sentence 
plan  ranker  (SPR)  which  scores  each  one  of  them  using  the  rules 
learned  by  RankBoost  and  which  then  chooses  the  best  sentence 
plan.  This  architecture  is  shown  in  Figure  4. 

We  compared  the  performance  of  our  sentence  planner  to  a  ran¬ 
dom  choice  of  sentence  plans,  and  to  the  sentence  plans  chosen 
as  top-ranked  by  the  human  judges.  The  mean  score  of  the  turns 
judged  best  by  the  human  judges  is  4.82  as  compared  with  the 
mean  of  4.56  for  the  turns  generated  by  our  sentence  planner,  for 
a  mean  difference  of  0.26  (5%)  on  a  scale  of  1  to  5.  The  mean  of 
the  scores  of  the  turns  picked  randomly  is  2.76,  for  a  mean  differ¬ 
ence  of  1.8  (36%).  We  validated  these  results  in  an  independent 
experiment  in  which  60  subjects  evaluated  different  realizations  for 
a  given  turn  [15].  (Recall  that  our  trainable  sentence  planner  was 
trained  on  the  scores  of  only  two  human  judges.)  This  evaluation 
revealed  that  the  choices  made  by  our  trainable  sentence  planner 
were  not  statistically  distinguishable  from  the  choices  ranked  at  the 
top  by  the  two  human  judges.  More  importantly,  they  were  also  not 
distinguishable  statistically  from  the  current  hand-crafted  template- 
based  output  of  the  AT&T  Communicator  system,  which  has  been 
developed  and  fine-tuned  over  an  extended  period  of  time  (the  train- 
able  sentence  planner  is  based  on  judgments  that  took  about  three 
person-days  to  make). 

5.  REALIZER 

At  the  level  of  the  surface  language,  the  difference  in  commu¬ 
nicative  intention  between  human-human  travel  advisory  dialogs 
and  the  intended  dialogs  is  not  as  relevant:  we  can  try  and  mimic 
the  human-human  transcripts  as  closely  as  possible.  To  show  this, 
we  have  performed  some  initial  experiments  using  FERGUS  (Flex- 


ible  Empiricist-Rationalist  Generation  Using  Syntax),  a  stochastic 
surface  realizer  which  incorporates  a  tree  model  and  a  linear  lan¬ 
guage  model  [2],  We  have  developed  a  metric  which  can  be  com¬ 
puted  automatically  from  the  syntactic  dependency  structure  of  the 
sentence  and  the  linear  order  chosen  by  the  realizer,  and  we  have 
shown  that  this  metric  correlates  with  human  judgments  of  the  fe¬ 
licity  of  the  sentence  [3].  Using  this  metric,  we  have  shown  that  the 
use  of  both  the  tree  model  and  the  linear  language  model  improves 
the  quality  of  the  output  of  FERGUS  over  the  use  of  only  one  or 
the  other  of  these  resources. 

FERGUS  was  originally  trained  on  the  Penn  Tree  Bank  cor¬ 
pus  consisting  of  Wall  Street  Journal  text  (WSJ).  The  results  on 
an  initial  set  of  Communicator  sentences  were  not  encouraging, 
presumably  because  there  are  few  questions  in  the  WSJ  corpus, 
and  furthermore,  specific  constructions  (including  what  as  deter¬ 
miner)  appear  to  be  completely  absent  (perhaps  due  to  a  newspaper 
style  file).  In  an  initial  experiment,  we  replaced  the  linear  language 
model  (LM)  trained  on  1  million  words  of  WSJ  by  an  LM  trained 
on  10,000  words  of  human-human  travel  planning  dialogs  collected 
at  CMU.  This  resulted  in  a  dramatic  improvement,  with  almost  all 
questions  being  generated  correctly.  Since  the  CMU  corpus  is  rel¬ 
atively  small  for  a  LM,  we  intend  to  experiment  with  finding  the 
ideal  combination  of  WSJ  and  CMU  corpora.  Furthermore,  we  are 
currently  in  the  process  of  syntactically  annotating  the  CMU  cor¬ 
pus  so  that  we  can  derive  a  tree  model  as  well.  We  expect  further 
improvements  in  quality  of  the  output,  and  we  expect  to  be  able 
to  exploit  the  kind  of  limited  lexical  variation  allowed  by  the  tree 
model  [1]. 

6.  CONCLUSION 

We  have  discussed  how  work  in  NLG  can  be  applied  in  the 
development  of  dialog  systems,  and  we  have  presented  two  ap¬ 
proaches  to  using  stochastic  models  and  machine  learning  in  NLG. 

Of  course,  the  final  justification  for  using  a  more  sophisticated  NLG 
architecture  must  come  from  user  trials  of  an  integrated  system. 
However,  we  suspect  that,  as  in  the  case  of  non-dialog  NLG  sys¬ 
tems,  the  strongest  arguments  in  favor  of  NLG  often  come  from 
software  engineering  issues  of  maintainability  and  extensibility,  which 
can  be  difficult  to  quantify  in  research  systems. 
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